{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1882c90a",
   "metadata": {},
   "source": [
    "\n",
    "# Survival Analysis in Biomedical Research\n",
    "\n",
    "This notebook demonstrates survival analysis techniques, including Kaplan-Meier curves, log-rank tests, and Cox proportional hazards regression, using Python libraries such as `scikit-survival` and `lifelines`.\n",
    "\n",
    "## Dataset Information and Links\n",
    "\n",
    "1. **Veterans Lung Cancer Dataset:**\n",
    "   - [Veterans Lung Cancer Dataset](https://github.com/sebp/scikit-survival/blob/v0.22.2/sksurv/datasets/data/veteran.arff)\n",
    "   - Provided by the `scikit-survival` package.\n",
    "\n",
    "2. **Breast Cancer Dataset:**\n",
    "   - Comes bundled with the `scikit-survival` package.\n",
    "\n",
    "### Prerequisites\n",
    "\n",
    "Before running the code, install the required packages:\n",
    "```bash\n",
    "pip install scikit-survival lifelines matplotlib numpy pandas\n",
    "```\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4acae5e9",
   "metadata": {},
   "source": [
    "\n",
    "## Kaplan-Meier Curve: Veterans Lung Cancer Dataset\n",
    "\n",
    "The Kaplan-Meier curve is used to estimate survival probabilities over time.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6808c32d",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Import necessary libraries\n",
    "from sksurv.datasets import load_veterans_lung_cancer\n",
    "from sksurv.nonparametric import kaplan_meier_estimator\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Load the veterans lung cancer dataset\n",
    "data_x, data_y = load_veterans_lung_cancer()\n",
    "\n",
    "# Kaplan-Meier survival estimates\n",
    "time, survival_prob = kaplan_meier_estimator(data_y[\"Status\"], data_y[\"Survival_in_days\"])\n",
    "\n",
    "# Plot the Kaplan-Meier curve\n",
    "plt.step(time, survival_prob, where=\"post\")\n",
    "plt.ylabel(\"Estimated probability of survival\")\n",
    "plt.xlabel(\"Time (days)\")\n",
    "plt.title(\"Kaplan-Meier Curve: Veterans Lung Cancer\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89049e3c",
   "metadata": {},
   "source": [
    "\n",
    "## Log-Rank Test: Comparing Treatment Groups\n",
    "\n",
    "The log-rank test evaluates whether there are significant differences between survival curves of two groups.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ea0db7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from sksurv.compare import compare_survival\n",
    "\n",
    "# Create treatment group indicator\n",
    "treatment = data_x[\"Treatment\"] == \"test\"\n",
    "\n",
    "# Log-rank test\n",
    "chi2, pvalue = compare_survival(data_y, data_x[\"Treatment\"])\n",
    "print(f\"Chi-squared value: {chi2}\")\n",
    "print(f\"P-value: {pvalue}\")\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18e5989f",
   "metadata": {},
   "source": [
    "\n",
    "## Kaplan-Meier Curves by Treatment Group\n",
    "\n",
    "Visualizing survival probabilities for different treatment groups.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10268b43",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Separate Kaplan-Meier estimates by group\n",
    "time_treatment, survival_prob_treatment = kaplan_meier_estimator(\n",
    "    data_y[\"Status\"][treatment], data_y[\"Survival_in_days\"][treatment]\n",
    ")\n",
    "time_control, survival_prob_control = kaplan_meier_estimator(\n",
    "    data_y[\"Status\"][~treatment], data_y[\"Survival_in_days\"][~treatment]\n",
    ")\n",
    "\n",
    "# Plot Kaplan-Meier curves\n",
    "plt.step(time_treatment, survival_prob_treatment, where=\"post\", label=\"Test Treatment\")\n",
    "plt.step(time_control, survival_prob_control, where=\"post\", label=\"Standard Treatment\")\n",
    "plt.ylabel(\"Estimated probability of survival\")\n",
    "plt.xlabel(\"Time (days)\")\n",
    "plt.legend()\n",
    "plt.title(\"Kaplan-Meier Curves by Treatment Group\")\n",
    "plt.show()\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fcc3a228",
   "metadata": {},
   "source": [
    "\n",
    "## Cox Proportional Hazards Regression: Breast Cancer Dataset\n",
    "\n",
    "The Cox regression model evaluates the relationship between survival time and covariates.\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26fcdf7e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from sksurv.datasets import load_breast_cancer\n",
    "from lifelines import CoxPHFitter\n",
    "import pandas as pd\n",
    "\n",
    "# Load the breast cancer dataset\n",
    "data_x, data_y = load_breast_cancer()\n",
    "\n",
    "# Prepare data for Cox regression\n",
    "data_x['size_group'] = (data_x['size'] >= 2).astype(int)\n",
    "data_x['er_positive'] = data_x['er'] == \"pos\"\n",
    "df = data_x[[\"age\", \"size_group\", \"er_positive\"]].copy()\n",
    "df[\"time\"] = data_y[\"t.tdm\"]\n",
    "df[\"event\"] = data_y[\"e.tdm\"]\n",
    "\n",
    "# Fit Cox proportional hazards model\n",
    "cph = CoxPHFitter()\n",
    "cph.fit(df, duration_col=\"time\", event_col=\"event\")\n",
    "cph.print_summary()\n",
    "        "
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}
